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ABSTRACT 

Hardware data prefetcher engines have been extensively used 
to reduce the impact of memory latency. However, micro¬ 
processors’ hardware prefetcher engines do not include any 
automatic hardware control able to dynamically tune their 
operation. This lacking architectural feature causes systems 
to operate with prefetchers in a fixed configuration, which in 
many cases harms performance and energy consumption. 

In this paper, a piece of software that solves the discussed 
problem in the context of the IBM POWER? microproces¬ 
sor is presented. The proposed solution involves using the 
runtime software as a bridge that is able to characterize user 
applications’ workload and dynamically reconfigure the pre¬ 
fetcher engine. The proposed mechanisms has been deployed 
over OmpSs, a state-of-the-art task-based programming model. 
The paper shows significant performance improvements over 
a representative set of microbenchmarks and High Perfor¬ 
mance Computing (HPC) applications. 

1. INTRODUCTION 

Hardware data prefetch is a performance optimiza¬ 
tion technique that helps to alleviate the so-called Mem¬ 
ory Wall |24| problem by taking advantage of applica¬ 
tions’ spatial locality when accessing to memory. Al¬ 
though some contemporary processors come with a set 
of knobs that adjust different parameters of the hard¬ 
ware prefetcher, their tuning is left to programmer’s 
responsibility being them set to a default configura¬ 
tion when the system boots up. Unfortunately, apart 
from being a source of detriment for application’s per¬ 
formance, in some cases, this default configuration can 
suppose a waste of consumed power. For example: prefetch¬ 
ing a great amount of data in each memory request may 
involve bringing unnecessary data that not only wastes 
power by overloading memory bandwidth, but also pol¬ 
lutes cache memory hierarchy potentially reducing the 
effective cache space, which can impact performance in 
multicore environments. 

The IBM POWER? microprocessor provides the 
user with the possibility to enable/disable the hardware 
prefetcher, also to tune the depth of each prefetch op¬ 


eration, to find store prefetch streams of data and to 
find strides in data accesses, which are gaps of a given 
fixed size in a data stream. Over this paper, it will be 
shown how different applications can benefit from this 
sort of knobs and it will be made evident that hard¬ 
ware prefetcher configuration can not be left to ran¬ 
domness nor default values but it needs of an algorithm 
that finds a balance in power-performance depending 
on each application workload. To provide the algorithm 
with data to determine which configuration to choose, 
placed in the runtime, a dynamic mechanism that can 
track performance of multithreaded workloads will be 
constructed. Specifically, the dynamic mechanism col¬ 
lects performance counters at task level thus being pos¬ 
sible to adjust the prefetcher configuration for each code 
region delimited by the programmer. 

This paper is organized as follows: Section 2 describes 
the IBM POWER? main characteristics and the pre¬ 
fetcher reconfigurability. Section 3 describes the pro¬ 
posed dynamic mechanism that finds the best prefetcher 
configuration at runtime. Next, Section 4 consists in an 
evaluation of the proposed solutions by means of an¬ 
alyzing performance metrics of selected representative 
benchmarks. Section 5 summarizes the related work 
and, finally. Section 6 presents the conclusions of this 
paper. 

2. BACKGROUND 

The IBM POWER? is an 8-way issue superscalar 
symmetric multiprocessor based on the Power Architec¬ 
ture. Its main specifications include: 8 cores with 4-way 
SMT; for each core, two separated LI caches of 32KB, 
one for data and other for instructions, plus a 256KB 
L2 cache. Furthermore, there is an on-chip 32MB L3 
shared cache where each core has its private 4MB por¬ 
tion, being able to access other portions though at a cost 
of higher latency. The IBM POWER? reconfigurability 
allows the end-user to choose the SMT degree, it can 
be set to single-thread, two-way and four-way. There is 
also the possibility to change the priority in the decoded 
instructions of each thread and there are also different 
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Table 1: Hardware prefetcher configurations 


DSCR 

Description 

DSCR 

Description 

xxOOl 

Off (disabled) 

xxlOl 

Deep 

xxOOO 

Default (Deep) 

xxllO 

Deeper 

xxOlO 

Shallowest 

xxlll 

Deepest 

xxOll 

Shallow 

xlxxx 

Prefetch on stores 

xxlOO 

Medium 

Ixxxx 

Stride-N 


knobs associated to the hardware data prefetcher that 
control its operation mode. 

The IBM POWERT’s hardware data prefetcher is 
programmable per each SMT hardware thread, which 
means that there are 32 configuration registers accessi¬ 
ble from the Operating System (OS). They are denoted 
as Data Stream Control Register (DSCR). They operate 
independently, which means that while one hardware 
thread is executing aggressively prefetching data, an¬ 
other one can be running with the prefetcher disabled. 
It is possible to enable or disable the prefetcher engine 
in each thread as well as to change the depth of each 
prefetcher operation. Moreover, detecting store data 
streams and strided accesses can also be enabled. Ta¬ 
ble [2 shows how to do it by writing the DSCR. Bits 
first to third are called default prefetcher depth (DPFD) 
where their value represent, in each case, the number of 
lines each prefetch operation brings from main memory 
to cache. The fifth bit, called Stride-N Stream Enable 
(SNSE), only has some effect in its activation if the 
fourth bit, called Store Stream Enable (SSE), is also 
enabled and the hardware data prefetcher is enabled. 

In this paper, OmpSs , a state-of-the-art task-based 
programming model is used. This programming model, 
similarly to the recent OpenMP 4.0 standard, lets the 
programmer to specify sequential regions of code with 
their data dependencies. These code regions are called 
tasks and can run once their input and control depen¬ 
dencies are satisfied. The OmpSs runtime system or¬ 
chestrates the parallel execution of the different tasks 
while makes sure all the dependences are satisfied. 

3. ADAPTIVE PREFETCHER 

In this paper, an adaptive prefetcher mechanism able 
to operate at runtime is proposed. Performance met¬ 
rics associated to application’s execution will be used to 
choose the most suitable configuration. The mechanism 
operates in two phases: During the exploration phase, 
each prefetcher configuration is evaluated in terms of 
performance improvement. During the stable phase, 
the best prefetcher configuration found in the explo¬ 
ration phase is used for another amount of consecu¬ 
tive tasks. A very similar approach about hardware 
prefetcher reconfiguration at runtime was recently pre¬ 
sented by Jimenez et al. [^. In their work, a fixed time 
of 10ms for the exploration phases as well as 100ms for 
the stable phases were proposed. These values were 


chosen empirically and aimed to mitigate the problem 
that appears when one prefetcher configuration is cho¬ 
sen but the application phase changes to another one 
that can benefit more from a different prefetcher con¬ 
figuration. However, their mechanism selects the same 
prefetcher configuration for all threads of an applica¬ 
tion in the stable phase. In this paper, a more powerful 
technique is presented, using granularity at OmpSs task 
level to characterize these phases and allowing to have 
different prefetcher configurations per task type, even if 
they run simultaneously. 

The lengths of exploration and stable phases are im¬ 
portant parameters of the adaptive mechanism. These 
lengths are measured in terms of number of task in¬ 
stances that are executed in the corresponding phase. 
The execution time of each task instance is measured 
and stored internally in the runtime system metadata. 
Regarding exploration phases, it is necessary to run 
enough experiments to filter the measurement noise while 
keeping the exploration phase as short as possible. First 
experiments are about finding optimal values for explo¬ 
ration and stable phases. Section [4!^ explains in detail 
the exploration we do in this paper regarding explo¬ 
ration phases’ lengths. 

The impact of having different optimal prefetcher con¬ 
figurations for different task types instead of having a 
task agnostic mechanism is also evaluated in Section 
Different OmpSs tasks may have different kinds of work¬ 
loads and therefore they can benefit more from different 
prefetcher configurations. However, that difference may 
not be large enough to compensate the additional over¬ 
head that this task type aware mechanism has. 

The trade-offs between performance improvement and 
power consumption in terms of memory bandwidth us¬ 
age are explored in Section 4.4 The paper presents 


and evaluates a solution based on an e parameter con¬ 
figurable by the user to determine what percentage of 
difference in the IPCs of one prefetcher configuration 
with respect to another one less aggressive is needed 
to choose it as optimal. In this case, it is important 
to see for each application the relation between the ag¬ 
gressiveness of the prefetcher configuration, the used 
bandwidth and the execution time. When the adap¬ 
tive prefetcher mechanism chooses aggressive prefetcher 
configurations, higher bandwidth rates are consumed. 
However, consuming more bandwidth with small per¬ 
formance improvement may not be worth. 


4. RESULTS EVALUATION 

4.1 Experiments Setup 

In this work the used system has been an IBM Blade- 
Center PS701; which basically is a blade containing 
one socket with an 8 core IBM POWER? running at 
3.0 GHz. Although the POWER? has two quad-channel 
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memory controllers, the PS701 uses a single memory 
controller offering up to 40GB/s of bandwidth. The OS 
is SUSE Linux Enterprise Server 11 SP3. Applications 
have been compiled with Mercurium 1.99.1 source-to- 
source compiler using as back-end compilers IBM XL 
C/C++ 11.1 and IBM XL Fortran 13.1. Prefetcher in¬ 
structions added by the back-end compiler have been 
disabled as only the dynamic mechanism within the 
runtime will be in charge of configuring the prefetcher. 
Prefetcher configurations’ performance monitoring has 
been done by collecting hardware counters using PAPI. 
Because bandwidth results involve dealing with shared 
performance counters, perf, the original implementation 
from Linux kernel, has been used. 

Regarding the used benchmarks, applications written 
in OmpSs programming model have been chosen from 
different sources as well as own written codes aiming to 
do stress tests in the system. Benchmarks are briefly 
explained here. 

• Dotproduct (DP): an OmpSs implementation of 

a dot product of two vectors a and b with a stride 
K {DPk — ■ *] ■ ' ^])- This microbench¬ 

mark was specifically created to test the hardware 
prefetcher in a controlled environment. 

• Jacobi: it computes the solution of a linear sys¬ 
tem obtained from a stencil scheme via the Jacobi 
iterative method. 

• K-means: it performs K-means clusterings, that 
is, partitions n observations into k clusters in which 
each observation belongs to the cluster with the 
nearest mean. 

• Knn: an implementation of a machine learning 
non-parametric method used for classification and 
regression called k-nearest neighbors. 

• SpecfemSD: this application simulates a 3D seis¬ 
mic wave propagation in any region of the Earth 
based on the spectral-element method [Is] . 

• Heat : it solves linear systems that come from heat 
distribution problems. 

By means of the Dotproduct benchmark, we validate 
the expected behavior of the IBM POWER? prefetcher. 
With a linear access pattern (AT = 1), enabling the pre¬ 
fetcher halves execution time. When the stride equals 
the cache line size, the aggressiveness of the prefetcher 
is critical, obtaining 5x speedups with the deepest pre¬ 
fetcher with respect to disabling it. When the stride is 
larger than twice the size of the cache line, the SNSE 
bit has to be set to observe performance improvements. 
Finally, these benchmarks are not sensitive to the SSE 
bit, since they accumulate the result in a single variable. 
Instead, if we compute the addition of two vectors and 


store it in a target vector, then the SSE bit significantly 
improves performance when activated. In the remain¬ 
ing of the paper, we assume K = 1 for the Dotproduct 
benchmark. 

4.2 Impact of Phases Lengths 

The first step to deploy a successful adaptive tech¬ 
nique is to evaluate the impact of phases lengths and 
figure out their optimal values. Exploration phases have 
to be wide enough to make sure that the phase is repre¬ 
sentative and thus the optimal prefetcher configurations 
can be extrapolated. 

We tested Dotprod, Jacobi, Spefem3D and Heat with 
1 and 8 threads computing for each case the relative 
IPC error of setting the prefetcher beforehand with re¬ 
spect to use the dynamic reconfiguration. So the idea 
is, for different lengths of exploration phases, to com¬ 
pute the relative error of the IPC in exploration phases 
with respect to executions when setting the prefetcher 
beforehand. 

In general, results for lengths of 2, 4, 8, 16, 32 and 
2500 tasks in each prefetcher configuration did not show 
a significant improvement in the relative error but, few 
applications showed a sensitive drop in the error when 
using lengths starting at 8 and 16 tasks. Regarding the 
biggest length, while sometimes benefiting from it, the 
error suffered from a high increment in general; that is 
because many OmpSs applications do not have so many 
task instances for some task types thus it is not possible 
to try all prefetcher configurations. 

We did not observe a high correlation between execu¬ 
tion times and different orders of magnitude in lengths 
of exploration phases. For this we decided to choose 
a length that allowed most of OmpSs applications to 
execute several times exploration phases. 

4.3 Impact of Classifying Task Types 

Next experiment consists in comparing performance 
of two versions of the dynamic mechanism: The first 
one classifies different task types when choosing the best 
prefetcher configuration, which implies gathering statis¬ 
tics for task types separately. The second approach 
treats all task types in the same way. 

Separating different task types can give better results 
because some types may benefit more from a given pre¬ 
fetcher configuration whereas other task types may ben¬ 
efit more from a different configuration. However, this 
difference could not be enough to compensate an addi¬ 
tional complexity of dealing with task types separately. 

Figure shows results of this experiment consider¬ 
ing different applications and different requested thread 
numbers. Speedups are calculated with respect to the 
version that does not classify by task type. Addition¬ 
ally, when we apply the task type aware approach we de¬ 
ploy an additional optimization that consists in saving 
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Figure 1: Performance speedup when classifying statistics per task type 


the prefetcher configuration per thread in the runtime 
meta-data, which avoids consulting and modifying the 
prefecther status very often. The Dotproduct bench¬ 
mark only has one task type, so the speedup observed 
when considering the 32 threads executions is obtained 
from this additional configuration. We included Dot- 
product in this set of experiments to evaluate these 
extra benefits, which are orthogonal to the task type 
aware approach. The rest of applications, which have 
more than one kind of task, show different behaviors. 
While Jacobi and SpecfemSD do not present speedups, 
Kmeans, Knn and Heat have speedups starting at par¬ 
allel level of 1 thread. The classifying version of the 
dynamic mechanism is considered as a useful improve¬ 
ment since it provides significant performance benefits. 

4.4 IPC Driven Power-Performance Optimiza¬ 
tion 

Having determined acceptable lengths for exploration 
and stable phases for any OmpSs application and with 
a dynamic mechanism classifying performance statistics 
of tasks by their type, the third experiment consists in 
saving power by reducing the aggressiveness of the pre¬ 
fetcher when this one does not bring a considerable gain 
in performance. This is done through a configurable 
parameter e that represents, in terms of percentages, 
the difference in the IPC of one aggressive configura¬ 
tion with respect to another one less aggressive. When 
the difference is smaller than the e, the most aggres¬ 
sive configuration is not considered to be better. The 
method starts from the disabled prefetcher configura¬ 
tion, and goes through more aggressive configurations 
step by step. 

Figure shows execution slowdowns considering e 
values of 0, 10, 20, 30, 40 and 80%. Dotproduct ap¬ 
plication results show a drop in the performance when 
e sets an 80% of difference in the IPC. This 80% of 
difference in the IPC is observed when passing from 
prefetcher disabled to enabled in the shallowest config¬ 
uration, which is the first step of increasing prefetcher 
aggressiveness. Jacobi presents nearly a 10% of slow¬ 
down for 1 and 8 threads configurations when setting e 


to 10%. Specfem3D interestingly shows a drop in the 
performance nearly in each step that heightens e, show¬ 
ing a strong correlation between the depth of the pre¬ 
fetcher and the obtained IPC. 

Regarding the memory bandwidth usage, both Dot- 
product and Specfem3D show a reduction that is con¬ 
sistent with the performance drop, meaning in these 
cases the applications fully exploit the extra bandwidth 
used by the prefetchet. Jacobi, knn and Heat do not 
show a consistent reduction in the used bandwidth and 
they neither suffer performance slowdowns when choos¬ 
ing less aggressive configurations. Finally, K-means ap¬ 
plication does not suffer from performance slowdowns 
in the execution time although the bandwidth usage 
gets significantly reduced when e increases. Therefore, 
in this particular application, the adaptive mechanism 
successfully selected the less aggressive prefecther con¬ 
figuration that provides maximum performance, avoid¬ 
ing the spending of useless memory bandwidth. 


5. RELATED WORK 

There have been many works that have dealt with 
data prefetch . First attempts were based on 

sequential prefetchers, this approach suggests to prefetch 
memory blocks sequentially. Despite being effective in 
these cases, this solution is not able to yield perfor¬ 
mance when the application does not follow a sequen¬ 
tial data access pattern. Due to this, further research in 
prefetchers was done to try to capture the non-sequential 
nature of those applications. Prefetch techniques aimed 
to deal with pointer-based applications have been stud¬ 


ied |6l ^01^51. Solihin et al. 21 made use of a user-level 


memory thread to do prefetching, getting in the ap¬ 
plications with irregular accesses significant speedups. 
Joseph and Grundwald worked on Markov-based prefetch¬ 
ers. Although most of these works about prefetching 
have not been put into practice with real processors, 
limit studies and prefetch analytical models have been 
proposed [7 22 


A further step in data prefetching is to consider the 
interaction between threads that take place in the CMP 
processors. Ebrahimi et al. [T^ and Lee et al. (T^ 
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Figure 2: Performance speedup and bandwidth when e increases 


study the effect of thread-interaction on prefetch and 
design prefetch systems that improve throughput and 
fairness. Liu and Sohilin |15| present a study about 
the impact prefetching has and bandwidth partition¬ 
ing in CMPs. Although there are many sutudies about 
data prefetching on top of simulators, there are very few 
works that make use of real processors. For instance, 
Wu and Martonosi 23 characterize the prefetcher of an 


Intel Nehalem processor and provide a straightforward 
algorithm that can control dynamically the activation 
and deactivation of the prefetcher. Nevertheless, their 
work only contemplates intra-application cache inter¬ 
ference obviating actual system performance. Liao et 
al. |14| build a machine learning model that dynamically 
modifies the prefetch configuration of the machines in 
a data center (based on Intel Core2 processors). Their 
work also bases its approach on turning on and off the 
prefetcher. 

Beyond enabling and disabling the prefetcher, there 
are other kind of works targeted to control thread ex¬ 
ecution rate. For example, playing with fetch policies 
within a SMT processor has been studied mill- They 
aim to increase throughput and/or provide quality of 
service (QoS). In the same line, the work of Boneti et 
al. study the usage of the dynamic hardware pri¬ 
orities in the IBM POWERS processor aiming to yield 
performance from resource balancing and prioritization. 
Qureshi and Patt 19 study how to improve throughput 
through solving the problem of partitioning the last- 
level cache for multiple applications. Moreto et al. [Tg] 


show a similar solution based on achieving QoS for mul¬ 
tiple applications running at the same time. 


6. CONCLUSIONS 

Contemporary microprocessors are being designed with 
reconfigurability features and increasingly more capable 
of counting different events by means of hardware coun¬ 
ters. In this paper, a portable solution implemented 
within a runtime smartly reconfigures the hardware pre¬ 
fetcher making use of hardware counters. A dynamic 
mechanism makes the process of reconfiguration auto¬ 
matic. Once it has enough collected performance data 
from different configurations, it calls to an algorithm 
that is in charge of determining which one is the most 
power-performance efficient. This process is repeated 
with a given timing during the application execution. 

A series of experiments have shown that sensitivity in 
performance is nearly negligible when collecting great 
amounts of data from performance counters; this can 
be attributed to the fact that few OmpSs tasks con¬ 
tain performance data that turns out to be represen¬ 
tative enough. Additionally, OmpSs task types classi¬ 
fication has a positive impact in performance because 
different OmpSs task types may benefit from different 
prefetcher configurations as task types may determine 
different kinds of workloads in the machine. Finally, 
a proposal for saving power in the cases in which ag¬ 
gressive prefetcher configurations do not come with a 
substantial speedup has proved to be potentially useful 
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and reaped good results. The underlying idea is to set 
an IPC percentage threshold that limits the aggressive¬ 
ness of chosen prefetcher configurations. 
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